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Abstract — Computational inference of causal relationships un- 
derlying complex networks, such as gene-regulatory pathways, is 
NP-complete due to its combinatorial nature when permuting all 
possible interactions. Markov chain Monte Carlo (MCMC) has 
been introduced to sample only part of the combinations while 
still guaranteeing convergence and traversability, which therefore 
becomes widely used. However, MCMC is not able to perform 
efficiently enough for networks that have more than 15~20 nodes 
because of the computational complexity. In this paper, we use 
general purpose processor (GPP) and general purpose graphics 
processing unit (GPGPU) to implement and accelerate a novel 
Bayesian network learning algorithm. With a hash-table-based 
memory-saving strategy and a novel task assigning strategy, we 
achieve a 10-fold acceleration per iteration than using a serial 
GPP. Specially, we use a greedy method to search for the best 
graph from a given order. We incorporate a prior component 
in the current scoring function, which further facilitates the 
searching. Overall, we are able to apply this system to networks 
with more than 60 nodes, allowing inferences and modeling of 
bigger and more complex networks than current methods. 

Index Terms — Bayesian Networks; GPU; MCMC; Priors 

I. Introduction 

Bayesian network (BN) is a probabilistic graphical model 
that describes causal relationship through directed acyclic 
graphs (DAG). In this work, we focus on the problem of 
learning a Bayesian network that characterizes the causal rela- 
tionship from experimental data. This problem is proved to be 
NP-complete (TJ. Due to the large number of graph structures, 
sampling-based methods are proposed to find the best match- 
ing graph using a scoring metric. Several sampling methods 
have been proposed, including graph-space sampling, order- 
space sampling, and order-graph sampling (2). Among all 
these sampling methods, order-space sampling is demonstrated 
to be the best one (2). However, these methods are still not 
efficient enough for large graphs. In this paper, we proposed 
a new strategy for sampling the order space. Specifically, we 
apply a greedy strategy into the order sampling procedure. 
Our new method is more accurate than the previous ones 
while maintaining the same complexity in the sampling stage. 

t Corresponding author. 



Furthermore, it does not require a postprocessing part which 
is needed in the previous methods. 

In addition to experimental data, in many situations, prior 
knowledge for at least part of a given Bayesian network is 
available. Adding prior knowledge in the learning process can 
enhance the accuracy of the result while significantly reduce 
the searching space. Methods of adding priors include graph- 
based and probability-aimed [3 ]. However, the "pairwise" prior 
knowledge which indicates the likelihood of one event causing 
another is more easily obtained. Yet, no methods exist that can 
integrate such prior knowledge into the BN learning algorithm. 
In this work, we propose a novel prior component which can 
be easily added into our scoring function as a pairwise weight. 
It represents user's "confidence" in the possible existence or 
non-existence of an edge. 

Nevertheless, learning Bayesian interactions is still 
compute-intensive, demanding both software and hardware 
advancements. Novel computational platforms such as field- 
programmable gate array (FPGA) and graphics processing unit 
(GPU) have been applied to facilitate the learning of Bayesian 
networks (4J-J6). GPU is known to be highly efficient for 
massively parallel computational problems. In this work, we 
exploit the parallelism in our novel Bayesian network learning 
algorithm and implement it on GPU. The combination of our 
novel algorithm and its GPU implementation allows us to 
learn graphs with up to 60 nodes. 

The remainder of the paper is organized as follows. In 
Section 2, we introduce the background on the problem of 
learning Bayesian networks. In Section 3, we describe the 
improved learning algorithm. In Section 4, we demonstrate 
our novel method for adding priors. In Section 5, we discuss 
the implementation on GPU. In Section 6, we show the 
experimental results on the performance of our algorithms. 
We conclude the paper in Section 7. 

II. Background 

A Bayesian network G is a probabilistic graphical model 
that represents a set of random variables and their conditional 
dependencies via a directed acyclic graph (DAG). Each node in 



the graph is associated with a random variable. Each directed 
edge indicates a causal relationship between the variables 
connected by that edge. Nodes that are not connected represent 
random variables that are conditionally independent. A parent 
set TTi of a given node V{ is a set of nodes which have a 
directed edge pointed to V{. Each node vi is also associated 
with a probability distribution conditioned on all its parent 
variables, P(vi\iri). The joint probability distribution of all 
the random variables in a Bayesian network can be written as 
a product of the conditional distributions for all the nodes: 



P(V 1 ,V 2 , ...,V n ) = Y[P(Vi\7Ti) 



(1) 



An example of a Bayesian network is shown in Figure [TJa). 
Every node is influenced by its parents. For instance, node E 
in Figure [TJa) has two parents, A and C. Thus, the probability 
distribution of E is determined by the states of A and C. This 
conditional distribution P(E\A,C) is shown in Figure [TJd) 
for all combinations of A and C. 
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Fig. 1. An example of a Baysian Network, (a) The structure of a Baysian 
Network, (b) The distribution associated with node A. (c) The conditional 
probability distribution associated with node B. (d) The conditional probabil- 
ity distribution associated with node E. 



In this work, we focus on Bayesian network composed 
of discrete random variables, which is a common Bayesian 
network model. For example, we can model a gene network 
using a discrete Bayesian network, whose random variables 
are discretized into three states which represent the under- 
expression, the normal expression and the over-expression of 
genes, respectively. Mechanisms for discretizing continuous 
data include MDL method [7], CAIM, CACC, Ameva, and 
many others (8). 

The problem here is to learn the Bayesian network structure 
from its experimental data. There are two types of data: 
observational data and experimental data. Observational data 
are obtained through observations without any human pertur- 
bations in the experiment. Experimental data are generated 
from manipulating one or more variables, and then observing 
the states of the other variables (9). For example, in work 
(TO), experimental data are generated by individually applying 



inhibitors to some of the genes. Usually, the causal relationship 
cannot be inferred only from observational data; experimental 
data are required (2). In this work, we assume that each input 
data are sampled from multinomial distributions and the data 
set is complete. 

Due to its super-exponential complexity, sampling methods 
are recently applied to solve the problem of learning Bayesian 
networks. One of them is graph sampling, which explores the 
huge graph space for a best graph. Another is order sampling, 
which explores a smaller order space for a best order. And 
there is order-graph sampling (2), which samples graphs for a 
given order. Order refers to the topological order of a DAG, 
which is an order on the nodes of the DAG such that V{ must 
precede Vj if vi is in the parent set ttj of Vj. Each DAG has 
at least one topological order denoted as -<. For example, a 
topological order for the graph shown in Figure [TJa) is A -< 
B <C <D -<E. 

TABLE I 

The number of graphs and the number of topological orders 
versus different numbers of nodes. 
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The graph learning problem is an NP-complete problem. 
The number of possible graphs grows super-exponentially 
with the number of nodes. Table U shows the numbers of 
possible graphs for different numbers of nodes. Compared 
to the number of graphs, the number of orders for a given 
number of nodes is much smaller. Table [J also lists the 
numbers of orders for the same set of node numbers. Due 
to the reduced number of combinations, order sampler can 
converge in fewer steps compared to graph sampler and hence, 
reduce the overall complexity. Moreover, sampling in order 
space provides opportunities for parallel implementation. Due 
to these advantages, we develop our algorithm within the 
framework of order-space sampling. 

Learning Bayesian networks aims at finding a graph struc- 
ture which best explains the data. We can measure each 
different Bayesian graph structure with a Bayesian scoring 
metric, which is defined as (3): 

P{G,D)=P(G)P(D\G) 

where P(G) denotes the prior distribution of a graph and D 
denotes the experimental data. Given a Bayesian network with 
n nodes, using the decomposition relation shown in Equation 
([I]), we can represent the scoring metric as a product of n local 
scores P{vi,iTi\ D) as 



P(G,D) = Y[P(v i ,ir i ;D). 



(2) 



The local score P{vi^i\D) can be calculated as 

J~J F(a ik ) jj T(N ijk + oiijk) 



P(v h 7n-D)=^ 



L T(a ik + N ik ) 



n 



(3) 



where 7 serves as a penalty for complex structures (2), a is 
the hyperparameter for prior of Bayesian Dirichlet score, Ti 
refers to the number of different states of the parents set 7r;, 
and \vi\ refers to the number of possible states of the random 
variable vi. N ik and Nij k are counted from experimental data 
(3). T is the gamma function (TTJ. In order to reduce the 
overall complexity, we limit the maximal size of parent sets 
to a constant s. 

III. Algorithm Description 

In this section, we discuss our algorithm. The overall flow 
of our algorithm is plotted in Figure |2| while its pseudocode is 
shown in Algorithm [T] After the preprocessing step, we start 
scoring an order. The score is defined to be the best score of all 
the graphs satisfying that order. The scored order is accepted 
based on the Metroplis-Hasting rule p2| . We then apply the 
Markov chain Monte Carlo (MCMC) strategy to sample the 
order space: each new order is generated from the previous 
one by randomly selecting and swapping two nodes in the 
previous order. We sample the orders for a specified number 
of iterations. Each subroutine of our algorithm is discussed in 
detail in the following sections. 
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Fig. 2. The whole process of BN learning algorithm. 



A. Preprocessing 

As shown in Figure [2] our learning algorithm is started 
with a preprocessing part, which includes order initialization 
and the generation of all possible local scores (refer to 
Equation ([3])). The order initialization randomly generates an 
initial order as the starting point. As we will show in Section 
III-B| the scoring part heavily relies on the computation of 



local scores. Indeed, each local score is repeatedly used in 
a large number of iterations. However, calculating the local 
score according to Equation ^ is time-consuming. Thus, 
instead of recomputing local scores each time when they are 
needed, we choose to compute local scores for all the possible 
combinations of the node and its parent set at the preprocessing 
stage. We store the result in a hash table keyed by the node Vi 
and the parent set 7^. Later on, when a local score for a specific 
combination of a node and its parent set is needed, we just 
fetch the score from the hash table. This strategy leads to more 
than 10 folds speedup on GPP according to our experimental 
results. 

Since the local score shown in Equation ^ is very small, 
we perform the computation in the log- space. Given a node 



Vi and its parents set 7^, equation for a local score (Is) is now 
changed to: 



ls(i,7n) =iog lo7 W + 2 



k=i 



log 10 T(a ik ) - log 10 T(a ik + N ik ) 



\Vi\ 



+ ^(log 10 T(N ijk + a ijk ) - log 10 T(a ijk )) 



(4) 



The scoring function for a graph is changed to: 

n 

P(G,D)cxJ2Hh^i) (5) 

B. Scoring 

Algorithm 1 Algorithm for our novel BN Learning algorithm. 
1: Preprocess() 

2: for 1 to iteration_num do 
3: for every node Vi in an order do 
4: maxLs <= —LargeNumber 
5: for each parent set m consistent with this order do 
6: if maxLs < ls(i,7Ti) then 

7: maxLs <= ls(i,7Ti) 

8: bestParents <^= m 

9: end if 

10: end for 

11: bestGraph.insert(i, bestParents) 
12: score <^= maxLs + score 
13: end for 

14: Metropolis-Hasting-Comparison() 
15: Best-Graph-Updating() 
16: Order-Generation() 
17: end for 

18: return globalBestGraph 

The scoring part is a major subroutine of our algorithm, 
which scores a given order. To effectively measure an order, 
we introduce a new scoring function different from the one 
proposed in [5 ]. Given a specific order, there are many graphs 
that satisfy that order. We define the score of an order to be 
the best score for all the graphs satisfying the order, i.e., 

P(D,^) oc maxP(G,D) 
Based on Equation ([5]), we further have 

n 

P(D, -<) oc max ^ ls(i, 7Ti) 

i=l 

Due to the Markov property of Bayesian networks, global 
maximum equals to the sum of the maximal local scores of all 
the nodes, each of which is taken among all the combinations 
of the node and its parent sets that are consistent with the order. 
Mathematically, the scoring function can be represented as 



P(D, -<) oc m ^ ls(i,7Ti) 



(6) 



where P ni is the set of all possible parent sets of the node 
Vi that are consistent with the order. The scoring subroutine 
is shown at Line 3 ~ 13 in Algorithm [T] We notice that a 



similar algorithm was previously mentioned in [13]. However, 
it is only used in the postprocessing part where a best graph 
is constructed from the best order. In (5), a different order 
scoring function was used, which is the sum of all the scores 
of the graphs that are consistent with the order. Compared to 
that scoring function, ours is better in the following ways: 

• Our algorithm only needs comparison and assignment op- 
erations, avoiding the time-consuming exponentiation and 
logarithm operations required by the previous algorithm. 

• The sum-based scoring function may lead to an incorrect 
result, because the best matching graph may not be con- 
sistent with the order which generates the largest score. 
However, since our function uses the max operation, the 
globally best graph must be consistent with the globally 
best order. 

• The previous algorithm needs a postprocessing part which 
constructs the best graph from the best order. Our algo- 
rithm generates the best graph for each sampled order. 
Thus, we do not need any postprocessing. 

In summary, since our algorithm avoids many expensive 
operations and reduces a large amount of computation, the 
total computation time is decreased. 

In (4| and (5), bit vectors are used to generate every 
compatible parent set with respect to a given order. However, 
our experimental results indicate that bit vector is not a 
suitable implementation since it is very slow. According to our 
experiment, the bit vector implementation consumes a huge 
amount of time for networks with more than 20 nodes. This 
is because for the last node in an order, each of the n — 1 
nodes preceding it could be its parent. Therefore, we need to 
compare 2 n_1 bit vectors to filter out the compatible parent 
sets for the last node. However, we notice that in practice the 
maximal size of a parent set is limited to a constant s <C n. 
Given this, we only need to consider X^=o ( n_1 ) potential 
parent sets for the last node, which is much smaller than 2 n . 
Table |II] compares the runtime for generating all 2 n parent sets 
with the runtime for generating only those parent sets with a 
size limit of 4. We compare these runtimes (per iteration) for 
different numbers of the candidate parents ranging from 15 to 
25. We can see that there is a dramatic increase in speed if 
we only generate those parent sets with a size limit of 4. 

TABLE II 

Runtime per iteration comparison between generating all 
possible parent sets with generating only parent sets with a 
size limit of 4. 



Number of 
Candidate Parents 


Generating all 
possible parent 
sets (Sec.) 


Generating parent 
sets with a size limit 
of 4 (Sec.) 


Speedup 


15 


0.011 


1 x 10~ b 


1100 


16 


0.017 


1.29 x 10~ 5 


1317 


17 


0.065 


1.66 x 10~ 5 


3915 


18 


0.104 


2.38 x 10~ 5 


4369 


19 


0.195 


2.86 x 10~ 5 


6818 


20 


0.297 


2.93 x 10- 5 


10136 


21 


0.645 


4.04 x 10- 5 


15965 


22 


1.248 


4.48 x 10~ 5 


27857 


23 


3.425 


5.43 x 10~ 5 


63075 


24 


6.814 


5.88 x 10- 5 


115884 


25 


12.185 


7.51 x 10- 5 


162250 



C. Metroplis -Hasting Comparison, Best Graph Updating, and 
Order Generation 

We apply the Markov chain Monte Carlo method (MCMC) 
to sample the order space, which essentially performs a 
random walk in that space. Each time a new order is proposed, 
even if its score is less than the score of the previous order, 
it still could be accepted based on the Metropolis-Hasting 
rule (12), which is to accept the new order with the probability 



p — min[l 



Suppose that the log-space score for the new order -< new 
and that for the previous order -< are score(-< new ) and 
score(^), respectively. The new order is accepted if 



log('u) < score(-< r 



score(-<), 



where u is a random number generated uniformly from the 
unit interval [0, 1]. Due to the property of MCMC, After a 
sufficient number of iterations, the Markov chain will converge 
to its steady distribution. At that time, each order is sampled 
with a frequency proportional to its posterior probability. Thus, 
an order with a high probability of occurring (corresponding 
to a high Bayesian score) is very likely to be sampled. 

Our ultimate goal is to find the graph with the highest score. 
Therefore, we keep track of a number of best graphs obtained 
so far as the sampling procedure proceeds. At the end of each 
iteration, if a new order is accepted, then we compare the score 
of the best graph consistent with that order to the scores of 
the best graphs recorded so far. We update the record of the 
best graphs if the current graph is better. 

At the end of each iteration, we generate a new 
order by randomly selecting two nodes V{ and Vj 
in the current order and swapping them, i.e., chang- 
ing the order (yi, • • • , • • • , • • • , v n ) to the order 



Oir 



, Vj , • • • , V{ , • • 



>V n ). 



IV. Priors for Characterizing Pairwise 
Relationship 



In this section, we present our novel prior component that 
could effectively characterize the prior knowledge on the 
causal relationship between a pair of nodes. 

Assume that a function p(i, m) indicates the prior knowl- 
edge on the causal relationship between a pair of nodes Vi and 
v m . Equivalently, p(i,m) represents the prior knowledge on 
the existence of an edge from v m to vi. We add p(i, m) into 
the scoring Equation ^ to affect the posterior probability of 
graphs as follows 



P(G,D) = l[^ JT p{i,m)\[ 



L T(a ik + N ik ) 



\vi\ 



T(N ijk 



a, 



ijk ) 



(7) 



Note that in the above equation, given an arbitrary graph, 
the prior probabilities on all the edges are multiplied together 
to influence the posterior probability of the graph. Thus, if the 
prior probability on the existence of an edge in the Bayesian 
network is large, the probabilities of the graphs containing that 



edge will be increased and hence, these graphs are more likely 
to be sampled. In the log-space, Equation ^ becomes 



PPF{i,m) = 100(/?[i,m] - 0.5) 3 



(8) 



where ls(i,iri) is the local score in Equation ([4]). We call 
log 10 p(i,m) as the pariwise prior function (PPF) for the 
nodes V{ and v m . It is also denoted as PPF(i,m). Thus, 
Equation (9) becomes 



P(G,D)o:J2 



i=i 



ls(i,TTi)+ PPF{i,m) 



m^TTi 



(9) 
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• when R[i,m] 

• when R[i,m] 
where 10 and 



With this general form of adding pairwise priors, we can 
meet different needs by applying different PPFs. 

In our design, we provide an interface for users. It is an n x n 
matrix R, where n is the number of nodes in the graph. Each 
entry in the matrix R is between zero and one. If the value 
R(i,m) is between and 0.5, it means that there unlikely 
exists an edge from v m to vi\ if the value R(i, m) is between 
0.5 and 1, then it means there likely exists an edge from v m 
to Vi, if the value R(i,m) is 0.5, it means that there is no 
bias on whether or not there exists such an edge from v m to 
Vi. This interface provides a convenient way to specify the 
pairwise priors. The actual PPF is a function on the value in 
the matrix R. It must satisfy the following requirements: 

. PPF(i, m) = iff R[i, m] 0.5 

. PPF(i, m) > iff R[i, m] > 0.5 

. PPF(i, m) < iff R[i, m] < 0.5 
Furthermore, according to our experiment results, PPF should 
also satisfy: 

approaches 1, PPF(i,m) is around 10 
approaches 0.5, PPF(i,m) approaches 
approaches 0, PPF(i,m) is around —10 
-10 are chosen empirically to have a significant 
impact on the ultimate score of a graph. 

Based on the above-mentioned requirements, we propose 
the following cubic polynomial to transform the value in the 
interface matrix R into the PPF: 

PPF(i, m) = 100(R[i, m] - 0.5) 3 (10) 

The above function is plotted in Figure [3] to give a clear view. 

V. Implementation of the Learning Algorithm on 

GPU 

In this section, we discuss the implementation of our 
algorithm on GPU for learning Bayesian networks. 

A. The Architecture of GPU 

Figure [4] shows the architecture of a typical GPU. Host 
refers to a CPU, which assigns tasks to and collects results 
from the GPU. As we show in Figure [5] the GPU implements 
the scoring part of our algorithm, since the max operation can 
be paralleled both within each node and across all the nodes 
(refer to Equation ([6])). The remaining parts of our learning 
algorithm are handled by the CPU. The CPU also takes charge 
of the communication with the GPU. Specifically, it passes a 




R[i,m] 



Fig. 3. Our proposed pairwise prior function with respect to the value in the 
interface matrix. 




Fig. 4. The architecture of GPU 



new order to the GPU and gets the best graph and its score 
from the GPU, as shown in Figure [4] 

A GPU contains a number of blocks connected in the form 
of a grid. Each block usually includes 256 threads. Each thread 
has a number of registers and a local memory. All the threads 
within a block can access the shared memory of that block. 
All the threads can also access the global memory of the GPU. 

The GPU we use is based on Fermi architecture (14). Fermi 
architecture provides true cache hierarchy for us to use the 
shared memory of GPU. Also, it is fast in context switching 
operation and the atomic operations of read-modify-write for 
parallel algorithms. Fermi architecture has up to 16 streaming 
multiprocessors (SM) with each containing 32 CUDA cores. 
Thus, it features up to 512 CUDA cores. A CUDA core 
executes a floating point or an integer instruction per clock 
for a thread. The GPU has 6 x 64-bit memory partitions for a 
384-bit memory interface, supporting up to a total of 6 GB of 
GDDR5 DRAM memory. The SM schedules threads in groups 
of 32 parallel threads called warps. Each SM features two 



warp schedulers and two instruction dispatch units, allowing 
two warps to be issued and executed concurrently. 

B. Task Assigning Strategy 

GPU implements the order scoring part of the Bayesian 
network learning algorithm. This requires us to assign the tasks 
evenly among all the blocks and all the threads. We describe 
our task assigning strategy in this section. 




Fig. 5. The implementation of the order scoring part for the BN learning 
algorithm on GPU. 

First, we assign h blocks for each node. These h blocks 
together will get the maximal local score max^.^p^ ls(i,7Ti) 
for the node V{ (refer to Equation ([6])). The number of local 
scores they need to compare equals to the size of the set P 7Vi , 
or the number of parent sets of the node Vi that are consistent 
with the given order -<. Now we will assign these \P W . | parent 
sets evenly to all the threads in the h blocks. Assume that the 
total number of threads in the h block is T. Then, each thread 
will handle \P n .\/T local scores and get the "local" maximum 
among them. After that, we will further compare all the local 
maximal scores obtained by the threads and get the largest 
one. We need to assign to each thread the parent sets they are 
in charge of. 

Since each thread has a thread ID and a block ID in the 
CUDA programming environment, we can assign a specific 
task to a thread based on its ID. The problem is how a 
thread predicts the parent sets that it needs to handle. This 
corresponds to predicting which parent set 7^ the fc-th thread 
needs to lookup in the hash table to get the local score ls{i^i). 
This problem can be converted into a subset indexing problem: 
given a set of n nodes, we want to index all the subsets with 
at most s nodes in a regular way so that given an arbitrary 
valid index we can easily get the corresponding subset. Note 
that the total number of the subsets with at most s nodes is 
S = J2 S j=o ( U j ) • Indeed, we can index these subsets in a regular 
way. For a example, consider a set of nodes {0, 1,2, 3, 4, 5}. 
If the size limit on the subsets is 4, then we can obtain 
the total number of subsets as S = X^=o(j) = *>7. We 
assign index to the subset {0, 1, 2, 3}, index 1 to the subset 
{0,1,2,4}, index 2 to the subset {0,1,2,5}, index 3 to the 



subset {0, 1, 3, 4}, index S — 2 to the subset {5}, and index 
S — 1 to the subset <\>. 

Now the problem is how to recover the subset from a given 
index if we use the above indexing method. We propose an 
algorithm shown in Algorithm [2] to solve this problem, which 
is inspired by an algorithm proposed in (T5). Since GPU 
cannot support recursive functions, we provide a non-recursive 
version. Given the number of candidate parents n, the size of 
parent sets k, and the index of the expected parent set I, it 
can return the Z-th parent set which is composed of k nodes 
chosen from the n candidates. 



Algorithm 2 Algorithm for obtaining the l-th /c-combination of n 
elements in lexicographic order. 

1: {Given three integers n, k, and /} 

2: low <£= 

3: {Compute the element for each position pos in the k- 

combination} 

4: for pos — 1 to k — 1 do 

5: {Compute the shift s} 

6: sum <^= 

7: for s — 1 to n do 

8: if sum + < I then 

9: sum <^= sum + (^-1) 

10: else 

11: break 

12: end if 

13: end for 

14: comb[pos] <^= low + s 

15: {Update parameters for obtaining the next element} 

16: n n — s 

17: k<=k-l 

18: / <^= / — sum 

19: low <^= comb[pos] 

20: end for 

21: comb[k] <^= low + I 

22: return comb; 



Our purpose is to compute the k -combination of n elements 
that is at a given position I in the lexicographic order, without 
explicitly counting them one by one. The solution is quite 
straightforward. Suppose that the n elements are 1, 2, . . . , n. 
We obtain each element in the /c-combination (ai, a2, . . . , a^) 
one by one from the first to the last. We assume that the 
elements in each combination are in increasing order from 
the first to the last, i.e, a\ < 0,2 < • • • < a&. With this 
assumption, we can see that there are (^~™) ^-combinations 
beginning with the value m (m = 1, 2, . . . , n — k + 1). 
Based on this fact, we can obtain the first element a\ as 
the largest number such that sum = Y^tLi ik-l) — ^ , 
In order to get the second element G&2, it is equivalent to 
obtaining the (fc — 1) -combination of (n — a\) elements at 
the position (I — sum). Thus, CL2 is the largest number s such 
that Y%=i ^ ( l ~ sum), plus the shift a i? namely 

d2 = a\ + s. We compute all the remaining elements in the 
combination in a similar way. 

With Algorithm [2] each thread can get the first parent set 
it needs to handle based on its ID. With this, the remaining 
parent sets it needs to handle can be obtained incrementally. 

However, the above algorithm requires additional compu- 
tation on GPU. Our second strategy is to create a parent set 
table (PST) and store all the combinations in the the global 
memory of the GPU. Figure [6] shows an example of the PST 



and the additional memory requirement for storing the PST. 
Suppose that we have in total T threads to handle S parent 
sets. Then, each thread handles ^ parent sets. Therefore, the 

z-th thread should handle the ^-th upto the ( ' + ^ x5 -th 
parent sets from the PST. Compared to the above-mentioned 
combinatorial algorithm, PST-based method is much faster 
since it only needs to read the table. Although it requires 
additional memory, the overhead is small. Indeed, as shown in 
Figure |6jb), a 60-node graph only costs 7.99 MB additional 
memory when the size limit on the parent set is s = 4. Using 
the PST and a proper mapping strategy, we can assign to each 
thread the parent sets it is responsible for. 
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Fig. 6. An example of a parent set table and its additional memory 
requirement, (a) The PST for a set of candidate parents {0,1,2,3,4,5}. 
The size of the subset is limited to 4. (b) The additional memory requirement 
for the PST versus the size of the candidate parent set. 
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Fig. 7. An illustration of the reduction algorithm to find the highest score 
in the shared memory. 

After completing its task, each thread stores its best parent 
set and the best score in a shared memory within each block. 
We further need to find the best score and the best parent set 
among all choices stored in the shared memory. In order to do 
this efficiently, we modified a reducing algorithm mentioned 
in p6) . Each thread has kept its local best parent set and the 
corresponding local best score. The problem is to pick the 
highest score among all the local best scores as well as its 
corresponding parent set. This problem is not as simple as 
the problem of searching the highest score since we have to 
recover its original position during a highly dynamic process. 
We have to consider both the efficiency and the correctness. 
An illustration of our algorithm is shown in Figure [7] Assume 
that a shared memory has 16 entries. We want to move the 



highest score to the entry of the array and record the ID 
of the thread that gives the highest score in entry 1. In this 
example, the highest score is —1 and the thread ID is 3. In 
the first reduction, we first divide the array into two halves. 
We move the higher scores to the left half and record their 
original thread IDs in the right half as shown in the third row 
of Figure [7] It requires half of the threads to participate. For 
example, thread compares its value with the value in entry 
8. Then, thread assigns entry of the shared memory with 
value —3 and entry 8 with 0, which is the ID of the thread 
giving the larger value —3. Each reduction halves the amount 
of memory involved. 

In the rest of the reductions, we have to keep track of the ID 
of the original thread that gives the better value. For example, 
in the second reduction, entry 2 of the shared memory has to 
be compared with entry 6 of the shared memory. Since —2 
is larger than —9, —2 is stored in entry 2. Note that —2 is 
now from entry 6. However, the ID of the original thread that 
gives the value —2 is store in entry 6 + 8 = 14, where 8 is the 
current number of threads involved. Then, we update entry 6 
with the original thread ID by copying the value of entry 14 
to entry 6. The total number of iterations required to get the 
best score among a total of n scores is log 2 n. After obtaining 
the ID of the original thread that gives the highest score, we 
can fetch the best parent set from that thread. 

VI. Experimental Results 

We perform experiments on our algorithm both on GPP 
and GPU. The GPP we use is a 2.4 GHz Intel Xeon E5620 
processor with 8GB RAM. The GPU we use is an NVIDIA 
Tesla M2090 GPU with 6GB GDDR5 RAM. Our GPU-based 
implementation is described in Figure [5] with its scoring part 
performed on the GPU and the remaining parts performed on 
the CPU. The operating system is Ubuntu 10.04.4. 

In our experiments, we set the maximal size of the parent 
set as 4 (s = 4). In our implementation, the GPU is used 
to accelerate each order scoring iteration. We first study its 
speedup effect. Figure [8] shows both the runtime of our scoring 
implementation on GPP and the runtime of the implementation 
on GPU for different graph sizes. From Figure [8] we can see 
that the GPU implementation achieves a significant speedup 
over the GPP implementation. The detailed runtimes per 
iteration for both the GPP and the GPU implementations, 



together with the acceleration rates, are listed in Table [III 
Acceleration rate is peaked at 10 for graphs with around 50 
nodes. For smaller graphs, i.e., graphs with fewer than 13 
nodes, their acceleration rates are less than 1. That is due to 
the time consumed on the context switching on GPU. As a 
result, GPU is not a good choice for small graphs. 

To make the results more practical, we further apply our 
learning algorithm to two well-known networks : 1) an 11- 
node signaling transduction network (STN) from human T-cell 
(TO); and 2) a 37-node ALARM network (17). 



Table IV shows the runtimes for both the 11 -node graph 
and the 37-node graph. Note that the preprocessing part of 
the GPU implementation is done on a CPU, as we mentioned 
before. The GPU-based implementation takes more time on 
preprocessing than the GPP-based implementation. Still, the 
total runtime of the GPU-based implementation is about 1/3 
of the runtime of the GPP-based implementation for the large 
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Fig. 8. Average runtimes per iteration for both the GPP and the GPU 
implementations. 



37-node network. Scoring orders is the most time-consuming 
part of the Bayesian network learning algorithm. Accelerating 
scoring subroutine is our primary goal in this work. We will 
study how to speedup the preprocessing part in our future 
work. 

We also compare the implementation that generates all 
possible parent sets with the implementation that generates 
only parent sets with a limited size. We evaluate these two 
implementations on GPP using the previous 11 -node graph 
and a randomly synthesized 20-node graph. We do not use the 
37-node graph because the generation of all the possible parent 
sets is prohibitively time-consuming. The runtime results are 
shown in Table fVl From the table, we can see that for both 



TABLE III 

Average runtimes per iteration for the GPP and the GPU 

IMPLEMENTATIONS AND THE SPEEDUPS OF THE GPU IMPLEMENTATION 
OVER THE GPP IMPLEMENTATION. 



# of 


GPP time per 


GPU time per 


Speedup of 


Nodes 


iteration (sec.) 


iteration (sec.) 


GPU over GPP 


13 


0.00024 


0.000461 


0.52 


15 


0.000574 


0.000511 


1.12 


17 


0.001223 


0.000645 


1.90 


20 


0.003384 


0.001053 


3.18 


25 


0.013076 


0.002059 


6.35 


30 


0.045229 


0.005027 


9.00 


35 


0.132726 


0.012319 


10.77 


40 


0.294657 


0.027673 


10.65 


45 


0.600787 


0.056061 


10.71 


50 


1.074849 


0.102469 


10.49 


55 


1.7365 


0.177313 


9.79 


60 


3.267 


0.3427 


9.53 



graphs, the total acceleration rate is almost 300% when we 
only generate parent sets with a limited size. The speedup in 
the preprocessing part is not so significant for the 11 -node 
graph, while it is more than 3 times faster for the 20-node 
graph. 

We also empirically study the accuracy of our algorithm. 
We use the receiver operating characteristic (ROC) curve 
introduced in p8| to measure the accuracy. A ROC curve is a 
plot of the true positive (TP) rate versus the false positive (FP) 
rate. True positive rate gives the fraction of true positives out 
of the observed positives, while false positive rate gives the 
fraction of false positives out of the observed negatives. The 
closer to the upper-left point (0, 1), the more accurate is the 
graph learning result. We tried a 20-node graph with 1,000 
and 10,000 iterations separately. The ROC curves for these 
two experiments are shown in Figure [9] and 10 respectively. 
Clearly, the resulting curve with 10,000 iterations is closer 
to the upper-left corner than the resulting curve with 1,000 
iterations. However, the curve with 1,000 iterations is pretty 
closer to the upper-left corner. It indicates that our algorithm 
is highly accurate with even a small number of iterations. In 
these two figures, the points from the right to the left are 
generated as follows: the first point is obtained without adding 
any prior knowledge on edges; the second point is obtained 
by assigning "interface" prior value (refer to Section |TV ) 
0.7/0.2 with a probability of 0.2 to edges which are mistakenly 
removed/added when learned without any prior knowledge; the 
third point is obtained by adding the same prior knowledge 
used in generating the second point but with a probability 
of 0.4; the fourth point is obtained by assigning "interface" 
prior value 0.8/0.1 with a probability of 0.2 to edges which 
are mistakenly removed/added when learned without any prior 
knowledge; the fifth point is obtained by adding the same 
prior knowledge used in generating the fourth point but with a 
probability of 0.4. Note that the priors added becomes stronger 
as we generate the points from the first to the last. 

In realistic situations, the observed data may contain a large 
amount of noise and hence become faulty. In order to learn 
BNs correctly in these situations, the algorithm must be highly 
tolerant to noise. We study the fault tolerance of our algorithm 
by injecting errors into the data. We test our algorithm in 
learning Bayesian networks with two states. In this case, we 
assume that each data has a probability p to flip its state. That 
is, every single data would change from 1 to or from to 
1 with a rate of p. In realistic context, this means that every 



TABLE V 

Runtimes for the implementation that generates all the 
possible parent sets and the implementation that generates 
only parent sets with a limited size. both are implemented on 

GPP. 



TABLE IV 

Runtimes of the GPP and the GPU implementations on an 

1 1 -NODE NETWORK AND A 37-NODE NETWORK. 
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Fig. 9. A ROC curve for learning a 20-node graph from 1,000 observed Fig. 11. A ROC curve for learning a 20-node graph from 1,000 observed 



data. Our learning algorithm samples the order space 10,000 times. 



data with different rates of fault injection. Our learning algorithm samples the 
order space 10,000 times. 
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Fig. 10. A ROC curve for learning a 20-node graph from 1,000 observed 
data. Our learning algorithm samples the order space 1,000 times. 



data has a possibility to be overestimated or underestimated. 
For p chosen as 0.01, 0.05, 0.06, 0.07, 0.08, 0.1, 0.11, 0.13 
and 0.15, the accuracy of our algorithm is shown in Figure 11 
in the form of a ROC curve. When p = 0.15, TP is 0.513514, 
which is not good enough. However, for p < 0.07, the results 
are acceptable in most applications. These results show that 
our algorithm has a relatively high noise tolerance. 

VII. CONCLUSION 

Learning Bayesian network structure from experimental 
data is a computational challenging problem. In this paper, we 
have demonstrated a novel BN learning algorithm and its im- 
plementation on GPU. Our proposed algorithm is three times 
faster than the traditional algorithm when run on GPP. Further, 
our GPU implementation has achieved a 10-fold speedup 
per iteration over the GPP implementation. When the entire 
learning procedure is considered, the GPU implementation 



has a 3-fold speedup. Overall, we have accelerated the BN 
learning algorithm at least 9 folds. Experimental results also 
demonstrated that our algorithm gives accurate result and is 
highly tolerant to errors in the data. 

Our algorithm is an improved version over the one proposed 
in (5j. We have proposed a better method for scoring the 
order based on the best graph consistent with the order. We 
have also introduced a new way of adding pairwise priors 
to enhance the accuracy of learning Bayesian networks. In 
addition, we have proposed two strategies for distributing the 
scoring tasks evenly among a given number of threads in 
GPU. In our current implementation, we take advantage of 
the parallelism in the order scoring part and accelerate that 
part using GPU. In our future work, we will study how to 
accelerate the preprocessing part using GPU. 
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